Web Page Downloading and Classification

نویسندگان

Loc Q. Tran

Chan W. Moon

Daniel X. Le

George R. Thoma

چکیده

This paper describes the processes of downloading and classifying Web-based articles in online medical journals as a preliminary step to extracting bibliographic data to populate MEDLINE , the widely used database of the National Library of Medicine (NLM). The processes are combined to develop an automated system named “Web Page Downloading and Classification”. The system downloads the Web pages using Microsoft’s Windows Internet API tool called WinInet, and a combination of several Artificial Intelligence (AI) techniques including the Breadth-First search algorithm and the Constraint Satisfaction method. The Breadth-First search algorithm and the Constraint Satisfaction method are then used to traverse the Web page’s links, identify these pages as abstract, full text, PDF or image files, recognize and generate the successors of the downloading pages.

متن کامل

منابع مشابه

A Novel Approach to Feature Selection Using PageRank algorithm for Web Page Classification

In this paper, a novel filter-based approach is proposed using the PageRank algorithm to select the optimal subset of features as well as to compute their weights for web page classification. To evaluate the proposed approach multiple experiments are performed using accuracy score as the main criterion on four different datasets, namely WebKB, Reuters-R8, Reuters-R52, and 20NewsGroups. By analy...

متن کامل

Demographic and motivation variables associated with Internet usage activities

Examines demographic variables (gender, age, educational level) and motivation variables (perceived ease of use, perceived enjoyment, perceived usefulness) associated with Internet usage activities (defined in terms of messaging, browsing, downloading and purchasing). A total of 1,370 usable responses were obtained using a Web page survey. Results showed that males are more likely to engage in ...

متن کامل

Automated Article Links Identification for Web-based Online Medical Journals

As part of research into Web-based document analysis including Web page downloading and classification, an algorithm has been developed to automatically identify article links in Web-based online journals. This algorithm is based on feature vectors calculated from attributes and contents of links extracted from HTML files, and an instancebased learning algorithm using a nearest neighbor methodo...

متن کامل

Prioritize the ordering of URL queue in Focused crawler

The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...

متن کامل

Analysis of Sources of Latency in Downloading Web Pages

Why does it take so long to download a Web page from a Web server? We analyze the download latency for pages for a variety of situations, in which Web browser and server are both within the same country as well as in different countries. Our study examines several sources of latency in accessing Web pages: DNS, TCP, the Web server itself, and the network links and routers. We divide the total d...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

متن کامل

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2001

Web Page Downloading and Classification

نویسندگان

چکیده

منابع مشابه

A Novel Approach to Feature Selection Using PageRank algorithm for Web Page Classification

Demographic and motivation variables associated with Internet usage activities

Automated Article Links Identification for Web-based Online Medical Journals

Prioritize the ordering of URL queue in Focused crawler

Analysis of Sources of Latency in Downloading Web Pages

عنوان ژورنال:

اشتراک گذاری